The goal of this project is to build a model to learn from object movement tracking data to predict how well barbell lifting is performed. Simple tree models are tried but do not perform well; they can achieve only 50% - 53% accuracy. Random forest model works very well and yields 99.54% accuracy when applied on hold-out sample set. The derived features of Euler angles provide significant predicting power to the model.
The goal of this project is to build a model to predict how well barbell lifting is performed. 6 participants were asked to perform weight lifting in 5 different ways: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). We will use data collected from movement tracking devices (put on the belt, forearm, arm, and dumbbell) to classify/predict in which way the participant is performing.
The dataset is generously provided by the authors of the paper “Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.” Read more information here (see the section on the Weight Lifting Exercise Dataset): http://groupware.les.inf.puc-rio.br/har#ixzz4SnGSJIzG.
The movement class (A,B,C,D,E) is recorded in column ‘classe’ in the dataset. Most of the columns are different type of movement readings from the devices. There are also columns for participant names, timestamps, and row numbers.
# loading required packages
library(caret); library(data.table); library(rpart.plot)
library(party); library(rattle); library(parallel); library(doSNOW)
# set working directory
setwd('C:\\Users\\hsinhung\\Documents\\Coursera\\Practical_ML_proj')
# set up parallel processing to speed up model training
cl <- makeCluster(2, type='SOCK')
registerDoSNOW( cl )
# Read training dataset into data.table
dt_ori <- data.table(read.table(file='pml-training.csv',sep=',',header=T))
Because there are sufficient amount of data (19622 observations), we decide to split data into training and test/validation sets: 80% observations going to training and 20% going to testing.
dim(dt_ori)
## [1] 19622 160
set.seed(3456)
trainIndex <- createDataPartition(dt_ori$classe, p = .8, list = FALSE, times = 1)
dt_train <- dt_ori[ trainIndex,] # training samples
dt_test <- dt_ori[ -trainIndex,] # hold-out testing samples
At a glance, this is a high-dimensional dataset with 160 variables (including the target column ‘classe’ ):
dim(dt_train)
## [1] 15699 160
The authors of the dataset described in their paper http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf that additional derived features were computed using sliding window approach. In the dataset column names with these prefixes were derived sliding-window features ‘avg_’, ‘var_’, ‘stddev_’, ‘max_’, ‘min_’, ‘amplitude_’, ‘kurtosis_’, ‘skewness_’. These sliding window aggregate features are not suitable to be used in our project because their values are missing in most of the rows in the data. Therefore we will eliminate these sliding-window features from the model.
We will also eliminate columns for row numbers and participant names. Timestamp and time window columns will also be dropped because they relate to sliding window features which were not needed in this study.
# Removing unused columns for derived sliding-window values
dt_train <- dt_train[,colnames(dt_train)[!grepl(pattern = '(avg_)|(var_)|(stddev_)|(max_)|(min_)|(amplitude_)|(kurtosis_)|(skewness_)', x = colnames(dt_train))],with=FALSE]
# Removing unused columns for participant names and timestamps
dt_train <- dt_train[,colnames(dt_train)[!grepl(pattern = '(X)|(user_name)|(timestamp)|(window)', x = colnames(dt_train))],with=FALSE]
After the cleaning, there are now more manageable 53 columns left (including the target ‘classe’ column):
dim(dt_train)
## [1] 15699 53
The authors of the dataset described in their paper http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf that they calculated features on the Euler angles (roll, pitch, and yaw) for each device location (dumbbell, arm, forearm, and belt). These calculated features represented some levels of aggregated information from all raw readings. We are interested in learning if the interaction between these aggregated fields yield any useful pattern. To reduce the complexity of the exploratory plot we pick only rows for classe ‘A’ and ‘D’ and do a pair scatter plot on these 12 Euler angle fields.
# Exploratory Plotting
library(AppliedPredictiveModeling)
classe_to_plot = c('A','D')
col_to_plot = colnames(dt_train)[grepl(x = colnames(dt_train),pattern = '(roll)|(pitch)|(yaw)')]
dtplotx = dt_train[classe %in% classe_to_plot, col_to_plot, with=FALSE]
dtploty = dt_train[classe %in% classe_to_plot,]$classe
transparentTheme(trans = .05)
featurePlot(x = dtplotx, y = dtploty, plot = 'pairs', auto.key = list(columns = 5))
The result is fairly interesting. The interactions between some fields seem to be able to differentiate class ‘A’ (red dots) from ‘D’ (black dots). In many plots class ‘A’ dots scattered much more widely than class ‘D’ ones, and class ‘D’ seems to have only positive values in ‘pitch_forearm’ while class ‘A’ took both negative and positive values. These identifiable patterns revealed by simple exploratory analysis gives some confidence that more sophisticated models should be able to pick up more complex patterns to achieve better predictions.
We first fit a simple decision tree to further explore if there exist any more easily identifiable patterns. Repeated cross-validation (10-fold repeated 10 times) is used to estimate model performance.
set.seed(1001)
# set up repeated cross-validation to find best model parameters:
train_control <- trainControl(method='repeatedcv', number=10, repeats=10, savePredictions = TRUE)
# fit a rpart model
rpart_fit <- train(classe~.,method="rpart",data=dt_train,trControl=train_control)
This simple rpart tree model doesn’t perform well. The best model parameter cp (complexity parameter) 0.036 yields unsatisfactory accuracy of 50.7%.
rpart_fit
## CART
##
## 15699 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 14128, 14130, 14128, 14128, 14129, 14131, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03266578 0.5072815 0.3566945
## 0.06008011 0.4264729 0.2264811
## 0.11508678 0.3361782 0.0791564
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03266578.
And from the tree plot it is clear that except for final node (3) and (4) in which class A and E are cleanly identified, all other nodes do not have dominating classes. Note that class D is not even identified by any nodes.
fancyRpartPlot(rpart_fit$finalModel)
Many models tend to perform worse when highly correlated features exist in the dataset. We try to identify highly correlated fields using 0.75 as the coefficient ratio cutoff. 20 fields were recommended to be removed:
# Removing highly correlated variables
correlationMatrix <- cor(dt_train[,1:52,with=FALSE])
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75, names=TRUE)
highlyCorrelated
## [1] "accel_belt_z" "roll_belt" "accel_belt_y"
## [4] "total_accel_belt" "accel_dumbbell_z" "accel_belt_x"
## [7] "pitch_belt" "magnet_dumbbell_x" "accel_dumbbell_y"
## [10] "magnet_dumbbell_y" "accel_arm_x" "accel_dumbbell_x"
## [13] "accel_arm_z" "magnet_arm_y" "magnet_belt_z"
## [16] "accel_forearm_y" "gyros_forearm_y" "gyros_dumbbell_x"
## [19] "gyros_dumbbell_z" "gyros_arm_x"
We then remove these 20 highly correlated fields from the dataset and fit a tree model again.
set.seed(1002)
# removing highly correlated fields
dt_train_condensed <- dt_train[,colnames(dt_train)[!(colnames(dt_train) %in% highlyCorrelated)],with=FALSE]
# set up repeated cross-validation to find best model parameters:
train_control <- trainControl(method='repeatedcv', number=10, repeats=10, savePredictions = TRUE)
# fit a rpart model
rpart_fit2 <- train(classe~.,method="rpart",data=dt_train_condensed,trControl=train_control)
With best model parameter cp (complexity parameter) 0.028 this new attempt yields accuracy of 53.7%. While some improvement has been achieved over last run, the result is still far from satisfactory.
rpart_fit2
## CART
##
## 15699 samples
## 32 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 14129, 14128, 14131, 14130, 14129, 14130, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.02785937 0.5372333 0.4182954
## 0.03035158 0.5139375 0.3898485
## 0.04348020 0.3981424 0.1961958
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02785937.
This new tree model is able to predict all 5 classes, but it seems to struggle to correctly classify classe ‘B’,‘C’, and ‘D’.
fancyRpartPlot(rpart_fit2$finalModel)
Now we will fit a Random Forest model to hopefully see more predicting power. We will continue to use the condensed dataset (after removing the highly correlated variables) here because it shows some success in improving the performance of simple rpart tree model. 10-fold cross-validation is chosen to estimate the model accuracy as well as to select best parameter for the model.
set.seed(1003)
train_control <- trainControl(method='cv', number=10, savePredictions = TRUE)
rf_fit <- train(classe~.,method="rf",data=dt_train_condensed,trControl=train_control)
save(rf_fit,file='robj_rf_fit')
The estimated accuracy of the random forest model is surprisingly high: 99.15% with mtry = 2 (randomly select 2 variables for each tree):
#load('robj_rf_fit')
rf_fit
## Random Forest
##
## 15699 samples
## 32 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 14127, 14129, 14129, 14129, 14129, 14131, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9914639 0.9892011
## 17 0.9910814 0.9887182
## 32 0.9803177 0.9751039
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
When examining the importance of variables in the model, we found 6 out of the top 10 most important variables are the derived aggregated features of Euler angles (roll_, pitch_, and yaw_ variables). This is consistent with our exploratory analysis in which identifiable patterns that differentiate classes are seen from the interactions of these Euler angle aggregates.
plot(varImp(rf_fit))
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.3.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
We also look at the out-of-bag error estimate produced by random forest algorithm. It is very low: 0.65% indicating accuracy of 99.35%. This is slightly more optimistic than the estimate from cross-validation, but these 2 accuracy estimates are close and consistent.
rf_fit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.65%
## Confusion matrix:
## A B C D E class.error
## A 4461 2 0 0 1 0.000672043
## B 13 3017 7 0 1 0.006912442
## C 0 20 2707 11 0 0.011322133
## D 0 0 41 2530 2 0.016712009
## E 0 0 0 4 2882 0.001386001
We will now use the hold-out dataset (dt_test) to test the performance of the random forest model. The test dataset will first be pre-processed in the same way training set went through:
# Removing unused columns for derived sliding-window values
dt_test <- dt_test[,colnames(dt_test)[!grepl(pattern = '(avg_)|(var_)|(stddev_)|(max_)|(min_)|(amplitude_)|(kurtosis_)|(skewness_)', x = colnames(dt_test))],with=FALSE]
# Removing unused columns for paticipant names and timestamps
dt_test <- dt_test[,colnames(dt_test)[!grepl(pattern = '(X)|(user_name)|(timestamp)|(window)', x = colnames(dt_test))],with=FALSE]
Making predictions and examining performance. The prediction accuracy on hold-out samples is very satisfactory: 99.54%.
rf_predict <- predict(rf_fit,newdata = dt_test)
confusionMatrix( rf_predict, dt_test$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 1 0 0 0
## B 0 756 2 0 0
## C 0 2 680 10 0
## D 0 0 2 632 0
## E 0 0 0 1 721
##
## Overall Statistics
##
## Accuracy : 0.9954
## 95% CI : (0.9928, 0.9973)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9942
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9960 0.9942 0.9829 1.0000
## Specificity 0.9996 0.9994 0.9963 0.9994 0.9997
## Pos Pred Value 0.9991 0.9974 0.9827 0.9968 0.9986
## Neg Pred Value 1.0000 0.9991 0.9988 0.9967 1.0000
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1927 0.1733 0.1611 0.1838
## Detection Prevalence 0.2847 0.1932 0.1764 0.1616 0.1840
## Balanced Accuracy 0.9998 0.9977 0.9952 0.9911 0.9998
Random forest model works very well on this body-movement datasets. Tested by hold-out samples, the model’s prediction accuracy reaches 99.54%, slightly higher than both the estimation from cross-validation (99.15%) and out-of-bag estimate (99.35%). The high hold-out sample prediction accuracy supports the good performance of the model. The consistent accuracy rates among hold-out sample predictions, cross-validation estimate, and out-of-bag accuracy estimates also show that cross-validation and out-of-bag accuracy estimates can reasonably predict the performance of models on unseen data.